The raw data, the cleaned data, and the R Markdown file are located in the following github repository:
github: https://github.com/xiayunj/Data-Visualization-Airbnb-New-York.git
Airbnb is an App beginning in 2008. It is a platform for hosts to provide houses and spaces for travelers to stay in during their vacation. Airbnb is an additional choice besides traditional hotels. The former makes travelers feel more like a home and usually higher quality based on similar rates. According to the survey on Clever: Airbnb’s Impact on the Hotel Industry, “60% of travelers who use both Airbnb and hotels prefer Airbnb over comparable hotels when going on vacation.” (Source link: https://listwithclever.com/real-estate-blog/airbnb-vs-hotels-study/).
We decided to focus our analysis on New York City’s airbnb data because it is one of the most world-famous metropolitans that attracts billions of visitors. Also we currently live in New York and we are interested in researching the Airbnb rental market here.
This report focuses on the Airbnb listings in New York City summarised in 2019 from three aspects: room types, prices, and ratings/comments. In the results part, we conducted an in-depth analysis of Airbnb listing data in NYC regarding the three aspects mentioned before and gained some insights that we want to share. In the interactivity part of the report, we further explore these three aspects in a more flexible and user-friendly way to help users have better senses of the questions we are focusing on and also provide useful information based on these three criteria.
1.Room Type - How do the room types distribute across neighborhoods quantitatively and spatially? Is there any relationship between room types and properties’ number of days available in a year?
2.Price- Based on two major room types(entire home/apartment & private room), is price differs by different neighborhoods or transportation options?
3.Ratings/comments- Is there any distribution patterns of rating scores in each neighborhood? How do emotions of users change over time for their airbnb experience?
The data is about 2019 Airbnb Listings in New York City. We downloaded data from Inside Airbnb (http://insideairbnb.com/get-the-data.html), and NYC OpenData (https://data.cityofnewyork.us/City-Government/Community-Districts/yfnk-k7r4). Our original data source includes 3 files:
Airbnb Listings data(about 48000 records) (listings.csv);
Reviews for each Airbnb listing(about 1 million records) (reviews.csv);
NYC Community Districts MAP (Community Districts.geojson).
Dataset 1 contains 106 variables, with slightly more categorical variables than numeric ones. Dataset 2 contains 6 variables and the most important one for analysis is the detailed comments. Since there are so many records in dataset 2, one issue appears to be that processing all of them would take a huge amount of computing time. Therefore, we need to do random sampling on the entire data. Another issue about our raw data is that there exists missing values in both datasets and including missing values of numeric variables in our analysis and exploration would result in calculation error. Therefore, we have to remove them under necessary conditions. In addition, there exists some listings with zero price which should not be included in our analysis, so we removed those data points. We also get rid of the listings with “availability_365” = 0, which means that these properties are not available now and there is no point for analyzing them.
Also, for the spatial plots in this project, we requested API key for credential on Google Map and assigned the individual communities to corresponding community districts according to the information provided on NYC Planning (https://communityprofiles.planning.nyc.gov/queens/12).
The two datasets we downloaded are in csv files, so we imported them into R using read_csv().
To request API from Google Map and imported into R, we used register_google().
The community districts map we downloaded is in GeoJSON file, so we imported them into R using geojson_read() from geojsonio library.
We plot a visna graph to investigate the missing patterns in variables that involved in following data analysis.
For missing row patterns, we observe that most of the rows don’t have missing values. Majority of rows that with missing values only have missings in variable transit.
For missing column patterns, we find that there are 8 columns having missing values. Variable transit has the highest number of missing values, with about 17% of the value missing. Since “transit” is an entry that needs to be manually entered by property owners, we think the laziness of some owners lead to the missing values in transit. Other 7 variables of comments, location_scores(denoted as “locatin_scrs” in the plot), checkin_scores(denoted as “checkin_scrs” in the plot), communication_scores(denoted as “cmmnctn_scrs” in the plot), clean_scores,accuracy_scores(denoted as “accuracy_scrs” in the plot) and review_scores(denoted as “rvw_scores” in the plot) only have missing values in very few rows hence their missing distribution are not reflected in the bar graph on the bottom.
The first aspect we are going to analyze is the room type. Here are the four types of rooms in all the Airbnb listings:
entire home/apt - an entire place, usually includes a bedroom, a bathroom, a kitchen, and a separate, dedicated entrance;
private room - the traveller has own private room for sleeping and may share some spaces with others;
shared room - sleeping in a space that is shared with others and share the entire space with other people;
hotel room.
First, we would like to have a general idea of the room type distribution across the five boroughs in New York City, and we chose to use the bar chart below to show the result.
Entire home/apt and private room are the two room types with maximum number of offerings in all five neighborhoods. These two major room types constitutes over 90% of all the listings. In Manhattan, the number of entire home/apt offerings are about twice the number of private room offerings; in Brooklyn and Staten Island, the numbers of entire home/apt and private room offerings are about the same; in Bronx and Queens, there are more private room offerings than entire home/apt offerings.
After analyzing the quantitative distribution of room types, we are also interested in how different room types distributed spatially, so we also plot the following spatial distributions of room types in five boroughs.
We can find that in Manhattan, majority of the private rooms are located in uptown areas, and most of the hotel room listings are in midtown. Entire room/apt are listed all over Manhattan. Most of the listings in Brooklyn are in downtown Brooklyn, which makes sense as it is close to Manhattan. Majority of the listings in Queens are in Long Island City, which is also close to Manhattan. There are no hotel room listings in Bronx and Staten Island. We can also see that listings in Staten Island are concentrated in the northeast corner. There are no concentrations of listings in Bronx, and they are sparsely distributed across Bronx, which seems intuitive, as Bronx is less densely populated compared with Manhattan, Brooklyn and Queens.
Then, to find out whether there exists any relationship between room types and properties’ number of days available in a year, we conducted the following plot categorized by room types.
From the graph, we can see that Entire home/apt and private room have similar distribution in number of days available for staying. They both have a peak at the right-hand-side of the histogram, which corresponds to the number of days available >300 (number of listings for 360+ available days is quite high although the height of the corresponding bar is not tall, because that bar has shorter width than all the other bars). This makes sense because some properties are just for leasing purposes. For entire home/apt, there is another peak at the left-hand-side of the histogram, which corresponds to the number of days available <30. We can interpret this peak to be days when the owners are out of town. For private room, the number of days available between 1 and 90 days all have quite high frequency. We think this is because the owners of private rooms usually have more flexibility in traveling compared to owners of entire home and therefore they offer their rooms for more days in a year.
For visitors to New York City, if they see some listings with fewer reviews than others, it does not necessarily mean the listing is not attractive. Those airbnb listings may just have fewer available days so fewer people got chances to stay at and comment on them.
From the above analysis concerning the room types, we get to know that the major two room types for Airbnb listings are entire home/ apt and private room. Since there is no point of comparing the price of entire home/apt in a region with the price of private room in another region, the following analysis about the price in this part will all be faceted on these two major room types.
First, let’s get a general idea about the price range distribution. We divided the Airbnb price range into five subranges: less than $100, between $100 and $200, between $200 and $300, between $300 and $400, and greater than $400. By counting the number of Airbnbs in such subranges grouped by the two major room types, we draw a bar chart as follows.
From the bar chart, we can find that most of the Airbnbs in NYC are distributed in price range less than $200. Particularly, most of the Airbnbs which are less than $100 are private rooms while most of the Airbnbs which are between $100 and $200 are entire home/apt.
Also, there are only few private rooms with prices greater than $300. The number of private rooms exceeds the number of entire homes/apts only in the price range less than $100. There are more entire homes/apts than private room in other four ranges.
Since locations for houses is significant, we want to see the spatial distribution of prices by room type. Therefore, two choropleth maps of average price for each community district ( please see the definition of “community district” in section 2 data source) are provided.
The first map is for entire home/apt, and the second is for private room. We can observe that the entire homes and private rooms located in midtown and downtown manhattan are on the expensive side. We think three factors contribute to the high listing prices in midtown and downtown:
there are many subway lines accessible in those areas so that tourists prefer to stay in a convenient place;
there are many tourist attractions around, such as the Statue of Liberty, Empire State Building and Times Square, so that those areas are top choices;
there are some new luxury buildings and fancy restaurants in that area ( especially in Midtown West and Tribeca).
We also find that the average price of private rooms in midtown are higher than that of entire homes in midtown, which is quite surprising. We suspect this could be because there are some listings for private rooms in luxury buildings in midtown, while the entire homes in such buildings are not listed. However, this is just our hypothesis and requires further investigation. For those regions with grey color filled, they are places like airports and parks, so no information is provided.
We are also interested in the quantitative distribution of prices, so we also drew histograms of prices. However, if we don’t drop outliers (we define outliers to be the data points with prices higher than (75% percentile + 1.5interquartile range) and prices lower than (25% percentile - 1.5interquartile range)), the graph will be highly skewed and can’t be seen clearly, so we dropped the outliers when plotting the price histograms.
From the above histograms, we can detect that the distribution for private room prices has less variance compared with entire home/apt prices. Prices for both room types have a unimodal distribution, with the mode price for entire home around 150 dollars and mode price for private room around 75 dollars. We can also see that distribution of entire home price is right skewed. We suspect it is because there are more variance in the layouts of entire homes. Some entire homes can be so fancy that they will be able to ask for much higher price than the mode.
As travellers may assume that airbnbs in more accessible areas might have higher price, we would like to verify this by investigating price over transportation choices. Among all the listings, we drew scatterplot, boxplot, and violin plot on the same graph to show the airbnb price categorized by availability of transportation near the house location faceted by two major room types (entire home/apt, private room). We have a total of three categories: Public_Transportation (listings for which the hosts provide public transportation options nearby in their transit description about the houses), Public_Transportation_Not_Mentioned (listings for which the hosts do not mention public transportation options in their transit description about the houses), and Transportation_NA (listings for which the hosts do not write transit descriptions about the houses).
In the process of forming the graph, we removed the data points with extremely high prices in order to show the shapes of the violins and boxes clearly without the interruption of outliers (outliers here have the same definition as above).
From this graph, we found that the middle of the price distributions for all three categories have similar values and are all around $150 per night, with the ones which do not have transportation options in the descriptions slightly higher than the ones with transportation options. This finding is kind of surprising, because people may assume that houses that have more transportation choices around would have higher prices than those do not have. However, our finding turns out to be that the existence of transportation do not increase the house prices. When travelers are selecting houses on Airbnb, it seems that they do not need to worry about the higher price that results from more convenient transportation options.
Apart from room type and price, reviews/comments is another area where travellers would pay a lot of attention to. Thus, this is the third aspect we want to analyze.
First, we analyzed the average review scores distribution by the five boroughs using the following boxplot.
From the above graph, we are able to conclude the following three main points:
We can see that the median review scores are very similar in five neighborhoods. The medians are all around 95 out of 100, which indicates that the median quality of listings in New York City are quite satisfying.
While the review scores in all five boroughs have outliers on the lower end, Brooklyn, Manhattan and Queens have some scores that are extremely low (lower than 40), this also makes sense as majority of the listings are in these three areas, therefore the scores will have larger variance.
We also discover that the 25th percentile of review scores are all above 90 in five boroughs, so if visitors set the filter score to 90, they still have 75% of the properties to choose from.
Now let’s have a look at the detailed comment words that travellers have posted on Airbnb APP and analyze on the wording part to get a sense of what sentiments and types of emotion typically appear in their comments.
Since there are about 1 million comments for all airbnb listings, the graph produced will become over-crowded. To overcome this we chose to sample about 10% of the entire data and did analysis on. Since many comments contain emoji and non-ASCII characters, we decided to transform these terms into ASCII characters. Then we extract only comments in English to analyze.
We analyzed all the words used in the comments over 2009-2019 to assign them into different emotion categories, including anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. We counted the percentage of words that belong to each emotion category for every year, and the emotion words usage over time is shown in the following graph.
It’s easy to see that over the 10-year period, most of the people’s comments about airbnb contain words that express trust, joy, and anticipation, which are all positive sentiment. Among all, the most frequent emotion is trust, which exceeds 30% of the entire words used. This finding indicates that people were satisfied with the quality of airbnb in NYC, and seldom expressed negative emotions(eg. fear, disgust, sadness) about their staying. This is consistent with the boxplot of review scores we presented before. Both graphs indicate that people are satisfied with their airbnb experience in NYC in general.
In the project, we include a total of approximately 30,000 Airbnb listings in the analysis. The map below contains the summary information for each of them. On the map, the circles indicate the number of airbnb listings in the surrounding areas. By simply clicking on the circles, the map will zoom-in to that area. Then you are able to see more circles within that area until seeing a blue pin. The blue pin shows the exact location of an airbnb. Clicking on the pin enables you to see the name of the home and some basic information about it, including a room type, average price, and average review score. Viewing the infobox, users can decide whether they would like to see more information about the house or not. If users are interested in the airbnb, they can click the name of the airbnb and then be navigated to its corresponding Airbnb webpage.
Shiny Link: https://xiayunj.shinyapps.io/Airbnb_NYC_All/
To explore the average price and review scores in each community district, we provide the following interactive map.
As price is one of the most important factors that determines people’s choice of airbnb, the project has discussed some aspects of airbnb prices. Moreover, in the interactive part, a heat map of average price is designed. In the heat map, users could figure out housing prices in each community district at a glance. Clicking on the district of interest, detailed information including average price per night, average customer review score, and the total number of airbnbs in the district are listed. So that users could be guided by such elements.
Moreover, travellers may also be interested in what attractions and spots are worth visiting around each neighbourhood. Here in this map, we also provide the famous attractions in each community district. Clicking the name of attraction in the infobox, travellers will be directed to wikipedia page of the attraction.
For those regions with grey color filled, they are places like airports and parks (exact information about the grey region is shown when you click the region). Since no Airbnb can possibly exist at those areas, no information about the rating/price/number of houses is provided.
The previous finding about review scores by neighbourhood shows that the average review scores for all five boroughs are above 90%. To explore the review scores within each community district, we provide the following interactive heat map. According to Average Customer Review Score, each community district are colored differently. By viewing the heatmap, users can intuitively learn airbnb users’ ratings for each district.
Apart from the overall high ratings across all five boroughs, a traveller may also want to know about the detailed ratings for a house, including the accuracy of description score, the cleanliness score, the checkin score, the communication score, and the location score. Then clicking districts of interest, all related scores are shown, including average overall rating score (in percentage) as well as the average detailed rating scores for these five aspects (out of 10). Users could observe detailed evaluation of airbnbs in the corresponding area and decide if airbnbs in the district are where they want to stay.
After looking at the interactive plot, travellers could get an understanding about which area is more suitable for them.
Shiny Link: https://xiayunj.shinyapps.io/Airbnb_NYC_Price_Rating/
In the whole project, we investigated airbnb listing in New York City and got three key findings:
There are four room types in New York City. Private rooms and entire homes constitute majority of the hostings. The distribution of the number of available days in a year for both entire homes and private rooms have two peaks: one peak is above 300 days and another peak is below 100 days.
Price range for most of airbnb listings in New York City is [0,300]. Listings in Manhattan is slightly higher than the other four boroughs. In addition, more convenient public transportation does not affect airbnb’s price.
People who have booked airbnb in New York City are mostly satisfied with their stays regardless of which boroughs they lived in and which year they visited NYC.
Finally, we utilized our data and findings to build two interactive maps. By using those two shiny apps, airbnb seekers could not only obtain an idea about which area to stay, but also can directly filter Airbnb selections through price and room type to visualize selections on the map of New York City.
Limitations:
Remember that we did random sampling on the entire set of 1 million reviews to extract about 10% of them for analysis. If we have higher computational power, we would be able to include all the comments for analysis of sentiments instead of analysing on the sample.
In interactive parts, we added clickable links, which navigates to its corresponding Airbnb website. However, since our dataset is not up-to-date, some listings might be outdated after some time. Therefore, when we click on those expired links, users would be directed to Airbnb home page instead of its corresponding website.
Future direction:
In the future, we would like to analyze how the Airbnb price relates to crime rates in different districts in NYC and draw maps to show the findings.
Lessons learned:
This project is a great opportunity for us to enhance our technical and soft skills. Working with multiple data files helped us strengthen our skills in data collection, cleaning and manipulation. Through our project, we also learned how to acquire google APIs; how to effectively use Github; how to deploy different R libraries especially shiny; and how to do text mining and sentiment analysis in R. Apart from the hard skills, we also learned how to effectively in a team. For example, we gained experience in brainstorming with teammates, in understanding and reconciling different opinions, and in sharing ideas and giving constructive feedback.